Training Data Cleaning for Text Classification

نویسندگان

  • Andrea Esuli
  • Fabrizio Sebastiani
چکیده

In text classification (TC) and other tasks involving supervised learning, labelled data may be scarce or expensive to obtain; strategies are thus needed for maximizing the effectiveness of the resulting classifiers while minimizing the required amount of training effort. Training data cleaning (TDC) consists in devising ranking functions that sort the original training examples in terms of how likely it is that the human annotator has misclassified them, thereby providing a convenient means for the human annotator to revise the training set so as to improve its quality. Working in the context of boosting-based learning methods we present three different techniques for performing TDC and, on two widely used TC benchmarks, evaluate them by their capability of spotting misclassified texts purposefully inserted in the training set.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

Using Fuzzy LR Numbers in Bayesian Text Classifier for Classifying Persian Text Documents

Text Classification is an important research field in information retrieval and text mining. The main task in text classification is to assign text documents in predefined categories based on documents’ contents and labeled-training samples. Since word detection is a difficult and time consuming task in Persian language, Bayesian text classifier is an appropriate approach to deal with different...

متن کامل

How Much Noise in Text is too Much: A Study in Automatic Document Classification

Noise is a stark reality in real life data. Especially in the domain of text analytics it has a significant impact as data cleaning forms a very large part (upto 80% time) of the data processing cycle. Noisy unstructured text is common in informal settings such as on-line chat, SMS, email, newsgroups and blogs, automatically transcribed text from speech data, and automatically recognized text f...

متن کامل

Exploiting Associations between Class Labels in Multi-label Classification

Multi-label classification has many applications in the text categorization, biology and medical diagnosis, in which multiple class labels can be assigned to each training instance simultaneously. As it is often the case that there are relationships between the labels, extracting the existing relationships between the labels and taking advantage of them during the training or prediction phases ...

متن کامل

Polish - English Speech Statistical Machine Translation Systems for the IWSLT 2014

This research explores effects of various training settings between Polish and English Statistical Machine Translation systems for spoken language. Various elements of the TED parallel text corpora for the IWSLT 2014 evaluation campaign were used as the basis for training of language models, and for development, tuning and testing of the translation system as well as Wikipedia based comparable ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009